Metadata Consistency Model of GFS

Learn how GFS can provide strong consistency for the metadata mutations.

Metadata#

The manager node stores all of the metadata in memory and serves user requests from there for good performance. Some part of the metadata is stored persistently on the hard disk of the manager node and also replicated on remote machines, while some metadata is not, as shown in the following illustration. (The part of the metadata that is not persistently stored can be rebuilt if needed. At times, such data is called a soft state.)

Metadata stored in the manager's memory vs. metadata stored persistently on the hard disk
Metadata stored in the manager's memory vs. metadata stored persistently on the hard disk

Note: In February 2017, AWS's S3 storage system suffered an outage. As part of the recovery, some subsystems needed a full restart, which took many hours. One of the reasons for this delay was because the soft state was being rebuilt. Since such restarts are uncommon for services like S3 (that often haven't been restarted in years), the designer might forget how large their systems have become over the years and how it will impact restart time. As designers, we need to be mindful about the total size of the system state, and how long it could take to rebuild it.

The location of each chunk's replicas doesn't need to be stored permanently on the manager node because the manager can determine this information from chunkservers through heartbeat messages. The chunkservers share their state and the chunks they hold with the manager.

Metadata such as namespaces and file-to-chunk mappings only occur at the manager level. We can't determine this metadata from anywhere else if it has been lost as a result of the failure and the manager node's restart. The manager checkpoints such metadata state and logs the changes to this metadata in an operation log that is stored on the manager's hard disk. On the restart, the manager loads the checkpoint and replays the operation log to have the up-to-date namespaces and file-to-chunk mappings in its memory again.

Shadow managers#

To recover from long-lasting failures or permanent failures at the manager node, the checkpoints and the operation log are replicated on remote machines, and also maintain the shadow manager. Such shadow managers are only used by clients when they cannot reach the original (primary) manager. Shadow managers can facilitate those client queries that don't need any change in the metadata, i.e., the read requests. Additionally, the shadow manager might be a bit behind the primary. A shadow manager periodically reads the primary manager's operation log from remote persistent storage and applies the changes to the metadata. During this window, clients will get stale metadata if they use any shadow manager. The primary manager synchronously replicates its checkpoint data and operation log on the remote storage.

We know that when we have replicas of something, there is always a chance of inconsistency among them. In the case of a single manager, we might lose some important metadata changes if the manager node fails before propagating metadata changes to the replicas. GFS ensures there won’t be inconsistencies in the metadata by implementing synchronous replication of the metadata changes. Let's see how synchronous replication works.

Point to ponder

Question

How do clients find out where a new manager node is after a possible failure, and how can a client reach the shadow managers?

Hide Answer

We can use the following DNS records:

  • Primary manager 10.1.1.1
  • Shadow manager1 10.1.1.2
  • Shadow manager2 10.1.1.3

When a new server is designated as a primary manager, we can update the record in the DNS. Clients can find current primary and shadow managers using the DNS.

Synchronous replication#

In synchronous replication, we respond to a client's request only when we are done logging the metadata changes to the operation log placed on the manager's hard disk and on all the replicas. The client can see the changes thereafter. The illustration below shows how synchronous replication takes place.

Created with Fabric.js 3.6.6
There is a GFS client, a manager node that contains all of the metadata in memory, and a remote storage machine (a replica of the metadata)

1 of 9

Created with Fabric.js 3.6.6
The manager node contains a persistent record of metadata (operation log and checkpoints) on its local disk and the same is replicated on a remote machine

2 of 9

Created with Fabric.js 3.6.6
The client makes a request to the manager, which causes a change in the metadata

3 of 9

Created with Fabric.js 3.6.6
The primary manager logs the metadata change operation in an operation log placed on its hard disk

4 of 9

Created with Fabric.js 3.6.6
The primary manager propagates the metadata change to the remote machine containing a replica of the operation log and checkpoints

5 of 9

Created with Fabric.js 3.6.6
The remote machine log's the metadata change operation to the operation log

6 of 9

Created with Fabric.js 3.6.6
The remote machine sends an acknowledgment to the primary manager that it has logged the operation successfully

7 of 9

Created with Fabric.js 3.6.6
The primary manager applies the metadata changes in memory

8 of 9

Created with Fabric.js 3.6.6
Since the metadata changes are successfully logged everywhere, the primary manager responds to the client's request with success

9 of 9

Failures can occur on the manager node. Let’s see how the manager handles the client’s operations when it fails while performing a metadata change operation.

  • If the manager node fails before writing a metadata change operation into the operation log, the operations should be retried by the client. The manager node will be able to perform the operation when it is recovered back. We can add this logic in the GFS client so that the manager failure errors are not shown to end users.
  • If the manager node fails after writing the metadata change operation into the log but hasn’t applied the changes, the operation will be there in the operation log. It will be replayed on the manager node’s recovery. After reconnecting with the manager, the client will see that the metadata changes were successfully made, so the client doesn’t need to retry that operation.
  • If the manager node fails after writing the metadata change operation to log, executing the metadata operation, but before returning success to the client. The client can see that the changes it requested are there when it reconnects with the manager; it should be smart enough not to request the same operation again (nevertheless, most of the API calls that a client can do returns an appropriate error if redone. For example, making a file twice with the same name, at the same location).

Summary#

For metadata, the GFS provides a strong consistency model so that the system works normally in case of a single manager's failure. We might rightfully think that the synchronous replication of each log record will affect the system's throughput and add latency. It is a tradeoff in performance that we must pay to achieve strong consistency in the face of failures.

To reduce the impact of flushing and replication of each individual log record on the overall system's throughput, the GFS batches several log records together. While batching benefits the throughput side, all of the clients in that batch will get the error if the manager node fails before it has a chance to push the batch out. All of the requests in the batch need to be retried by the client(s) if a failure or timeout happens. This is an example of a tradeoff between throughput and latency. Because IO operations are slow, batching many requests together reduces the trips we take to remote storage, and increases the throughput of storing/synching metadata.

Dealing with Data Inconsistencies in GFS

Evaluation of GFS